Research on water environmental indicators prediction method based on EEMD decomposition with CNN-BiLSTM

Water resources protection is related to the development of the social economy, and the monitoring and prediction of water environmental indicators have important practical significance. In view of the seasonality, periodicity, uncertainty, and nonlinear characteristics of water quality indicators data, traditional prediction models have poor performance. To address this issue, this paper introduces a hybrid water quality index prediction model based on Ensemble Empirical Mode Decomposition (EEMD), combined with Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory Network (BiLSTM). We have conducted out experiments to predict dissolved oxygen based on the water quality monitoring indicators of the Liaohe National Control Sanhongcun Village station in Yichun City. The results show that the model proposed in this paper improves the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document}R2 index by 5%, 7% and 5% compared to the suboptimal model in the 4-h, 1-day and 2-day index predictions, respectively.


Water environment indicator decomposition
EEMD was proposed by Wu et al. 19 based on Empirical Mode Decomposition (EMD) to overcome the problem of mode mixing in EMD decomposition.
EEMD is a method that involves adding Gaussian white noise to the original sequence, applying EMD to the sequence multiple times according to a predefined number of experiments, and then taking the average of the decomposition results to eliminate the influence of noise.This methodology imparts properties of uniform distribution and smoothness to the original sequence.The steps for sequence decomposition in EEMD are as follows: (i) Add white noise of limited amplitude to the original indicator sequence to obtain a new sequence: where X(X ∈ R (m×n) ) is the original sequence, ε s is white noise,and X s is the new sequence.(ii) Decompose X s into Intrinsic Mode Function(IMF) components using EMD: where C EMD,s l is the intrinsic mode function after EEMD decomposition, r(t) is residual.(iii) Repeat the above steps according to the set number of times and calculate the final result: The process flow of EEMD decomposition for water quality indicators is illustrated in the Fig. 1.

Local correlation feature extraction of water environment indicators
Convolutional neural networks (CNN) are feedforward neural networks that use convolution and pooling operations for feature extraction.It is an important algorithm in deep learning.For time series data, 1D convolutions are often used.
In this paper, a sliding window is employed on the water environment indicator sequence to extract local features.Additional noise filtering is carried out through convolution and pooling operations to achieve enhanced outcomes.The specific formula is as follows: where w is the convolution kernel, * denotes convolution, X represents the water quality indicator sequence that has been decomposed by EEMD, and Y is the extracted feature.

Temporal dependence modeling of water environment indicators
This paper chooses BiLSTM to model temporal dependencies.BiLSTM constitutes an advancement over the LSTM neural network.Relevant research 20 indicates that BiLSTM offers noteworthy enhancements in performance compared to LSTM for time series prediction tasks. (1) where Y represents the vector of target variables to be predicted, H represents the prediction results.BiLSTM consists of two layers of LSTM neural networks that operate in opposing directions.Rather than merely stacking the two LSTM layers, it integrates data features from both forward and backward directions at the present time step for predictive purposes.

Model building
Given the strong coupling and nonlinear characteristics of water environment monitoring data, traditional prediction methods often yield subpar results.Accordingly, this paper introduces a CNN-BiLSTM hybrid model for water environment data prediction based on EEMD decomposition.
Initially, the preprocessed water environment data is decomposed by EEMD, yielding four modes.Each of these modes is subsequently fed into both CNN and BiLSTM for feature extraction.Ultimately, the extracted features are accumulated and reconstructed to derive the predictive outcome.
This hybrid model synergistically integrates EEMD, CNN, and BiLSTM to capitalize on the strengths of each component: EEMD for noise reduction, CNN for capturing local features, and BiLSTM for modeling sequential dependencies.The ensemble methodology has the potential to enhance prediction accuracy.In this experiment, dissolved oxygen is decomposed by EEMD, and then combined with other indicators to form new training data.The model structure is illustrated in the Fig. 2.

Experiments Dataset
The research focuses on water quality monitoring data obtained from the national monitoring station in Sanhong Village, Liaohe.Liaohe is the largest tributary of Xiuhe River,which traverses Jing'an County in Yichun City.It holds significance as the primary river in the county and eventually merges into Poyang Lake via the Xiuhe River.
The monitoring dataset spans from November 2020 to December 2022,with measurements taken every four hours, amounting to a total of 4,700 data points.It encompasses nine indicators: water temperature (TEMP),pH,dissolved oxygen (DO),potassium permanganate (PP),ammonia nitrogen (TAN),total phosphorus (TP),total nitrogen (TN),electrical conductivity (EC),and turbidity (TUB).This dataset is obtained from the Environmental Quality Information Release Platform of Jiangxi Province.
In addition, meteorological data from Yichun City covering the same time period was also gathered, encompassing six indicators:temperature,atmospheric pressure,humidity,wind speed,dew point temperature,and precipitation.This data is obtained from the website "Reliable Prognosis".
Among the various water quality indicators, the concentration of dissolved oxygen serves as a crucial benchmark for assessing water quality 21 .Consequently, this paper focuses on utilizing dissolved oxygen as the target indicator for model prediction.
Through a series of experiments and evaluations, it was determined that '4' was the optimal number of modalities, as it demonstrated the best performance and accuracy during model training.In this paper, the EEMD   www.nature.com/scientificreports/method (4 modes) is employed to decompose the dissolved oxygen indicator through experimental comparison.
The waveform diagrams of each mode after decomposition in the validation and test sets are illustrated in Fig. 3: Through autocorrelation experiments, we observed that the three modes: IMF1, IMF2, and IMF3 exhibit evident cyclical characteristics, while IMF4 retains the trend characteristic inherent in the data.
(i) Missing and outlier value handling During the analysis of the data, it was discovered that certain issues such as missing values and outliers existed due to factors like equipment maintenance or malfunctions that occurred during the data collection process.
For indicators with a significant number of consecutive missing values, linear interpolation is employed to fill in the gaps according to the formula: where x represents time, ϕ(x) represents the estimated value at that specific time x.The coordinates x 0 and y 0 represent the first known data point, x 1 and y 1 represent the second known data point.
(ii)Normalization As water quality indicators possess distinct scales, for optimal model training, each indicator is normalized using the formula: where x is the original data that needs to be normalized, x ′ is the normalized data, and its value range is [0,1], max(x) and min(x) are the maximum and minimum values in the dataset, respectively.

(iii) Correlation analysis
To investigate the significance of each indicator in the prediction process, correlation analysis is conducted on the data, and a correlation heat map is presented in the figure 4.
It is evident that following EEMD decomposition, the correlations between dissolved oxygen and various indicators such as temperature, electrical conductivity, ammonia nitrogen, and total nitrogen have demonstrated an increase.

Determination of model parameters
In this paper, grid search is employed to optimize the model parameters.Only one parameter is adjusted at a time, and grid search is utilized for fine-tuning.Through iterative execution of the aforementioned steps, the optimized model parameters are presented in Table 1:

Metrics for experimental evaluation
Mean absolute error (MAE),mean square error (MSE),Mean Absolute Percentage Error (MAPE) and correlation coefficient (R 2 ) are employed as quantitative metrics to assess the predictive performance of the model.( 6) where y is the true value, ŷ is the predicted value, and ȳ is the mean of the indicator.When comparing models, a lower value of MAE, MSE, and MAPE indicates better model performance, while an R 2 value closer to 1 signi- fies a superior model.

Experimental design
Dissolved oxygen is chosen as the target variable for prediction, and both single-step and multi-step predictions are carried out.Based on data correlation analysis, the following four combinations of data have been designed as described in Table 2: Based on the above 4 data combinations,the experiments are designed as follows:   www.nature.com/scientificreports/

Experimental results and analysis
In this paper, relevant experiments are conducted in accordance with the aforementioned plan.(i) Sliding Window Size Experiment: To determine the optimal window size, comparative experiments are performed using window sizes of 8 and 48 for XGBoost, LSTM, GRU, and our proposed model.
Based on the experimental results, it appears that each model demonstrates a low sensitivity to the window size.Taking the R 2 metric as an example,in the XGBoost model, there is only a 2% improvement in prediction results when the window size was increased to 48.However, better prediction results were observed in the other models when the window size was set to 8. Consequently, this paper opts for a window size of 8 in subsequent experiments.
(ii) Popular prediction models commonly used in the field of time series forecasting, namely XGBoost, LSTM, and GRU, are selected for comparison.In the realm of time series forecasting, several popular prediction models are commonly employed for comparative analysis.These models include XGBoost, LSTM, and GRU.In light of the widespread adoption of transformer-based models for time series prediction, Temporal Fusion Transformer (TFT) was introduced by Bryan et al. 22 TFT is capable of learning intricate relationships between different temporal scales within time series data.Building upon this, Jitha et al. 23 leveraged the temporal fusion transformer architecture to model and predict river water quality indicators.
Additionally, Zhou et al. 24 proposed the Informer model for long-term time series prediction.Therefore, we conducted experiments incorporating the Informer model into our comparative analysis.
The comparison experiment is conducted at step sizes of 1 (4 hours), 6 (1 day), 12 (2 days), and 18 (3 days).The results are presented in Table 3, with the optimal results are in bold.
According to the results, the proposed model in this paper consistently achieves the best prediction performance at step 1, 6 and 12 in Combination 1, with improvements in R 2 of 5%, 7%, 5% compared to the second-best model.And in step 18, the model achieved a second-best result, with a difference of only 0.01 from the optimal value.When meteorological data is introduced (Combination 2), there is a little enhancement in prediction performance observed for any of the models, and the R 2 values remain relatively consistent across different step sizes.Notably, the proposed model continues to deliver optimal results at step sizes of 1, 6, and 12.At the step 18,Informer performed slightly better than our proposed model, proving the advantage of the informer in long-term prediction.
As the prediction step size increases, the forecasting performance of various models tends to decline.However, the proposed model consistently achieves the best results across nearly all step sizes, demonstrating its efficacy in dissolved oxygen prediction.
Examining the 1-step prediction curve, it is evident that the proposed model in this paper provides a better fit to the actual values, with the curves nearly overlapping the true values.The curves are depicted in Fig. 5.
(iii) Following correlation analysis, the top 4 most strongly correlated indicators are selected and utilized in conjunction with the proposed model for multi-step prediction.The results are presented in Table 4, with the optimal value are in bold for reference.www.nature.com/scientificreports/It is evident that the prediction accuracy remains relatively consistent even after indicator screening based on correlation analysis.Specifically, Combination 3 achieves the second-best R 2 value in 1-step prediction, while Combination 4 attains the optimal R 2 value in 6-step prediction.
In summary, the selection of indicators that are highly correlated with the target allows for a reduction in data dimensionality without significantly compromising the model's performance.The proposed model, when incorporated with these correlated indicators, continues to deliver robust multi-step dissolved oxygen forecasting.www.nature.com/scientificreports/This approach enables more efficient water quality modeling by utilizing fewer but informative variables, thereby streamlining the modeling process.
(iv) Ablation Experiment: To further substantiate the contributions of individual modules within the proposed model, corresponding ablation experiments have been devised.The results are presented in Table 5, with the optimal value highlighted by bold for clarity.
It is evident that the inclusion of the CNN module enhances prediction performance at step 1.However, its influence diminishes as the step size escalates.Conversely, the introduction of the EEMD decomposition module leads to marked improvements in prediction performance, attaining the second-best results consistently across all step sizes for both Combinations 1 and 2. This underscores that EEMD contributes more significantly towards enhancing predictions compared to the CNN module.

Discussion and conclusion
Given the seasonal, periodic, uncertain, nonlinear, and intricate interdependencies among indicators within water environmental monitoring data, this paper introduces a hybrid CNN-BiLSTM model integrated with EEMD decomposition for water quality data prediction.
The EEMD decomposition technique is highly effective in mitigating noise interference within the data.Additionally, the four resulting modes from this decomposition process augment the data available for model www.nature.com/scientificreports/training, thereby enhancing the training efficacy of the model.The incorporation of CNN enables the model to excel in extracting local features, and its integration with BiLSTM facilitates the utilization of bidirectional data and the acquisition of higher-level features, collectively bolstering prediction performance.
Based on prediction experiments conducted on the dissolved oxygen indicator, the proposed model in this paper demonstrates superior prediction performance compared to existing models.This constitutes a valuable exploration of the practical applications of artificial intelligence technology in the realm of water resource protection.In future, the determination of modal quantity in EEMD, data augmentation for water quality data and and the application of Transformers in long-term water quality data prediction would be beneficial research directions.
In conclusion, the proposed hybrid deep learning approach provides an effective solution for precise multistep water quality forecasting, capable of addressing the intricate attributes of water environment data.The findings underscore the viability of harnessing advanced AI techniques to enhance environmental modeling and conservation efforts.

Figure 4 .
Figure 4. Heat map: (a) is correlation between water quality indicators, (b) is IMF4 correlation heat map after EEMD decomposition.
(i) Window size experiment:Verify the impact of window size on results.(ii) Model comparison:Compare with mainstream time series prediction models XGBoost, LSTM, GRU, Informer.(iii) Correlation experiment:Conduct multi-step comparative prediction experiments on four data combinations.(iv) Ablation experiment:Verify the role of each module through ablation experiment.

Table 1 .
Model parameters for each model.Since each model has different characteristics,the parameters that need to be set are not exactly the same.In the table, "-" indicates that the model does not need to set this parameter.In order to facilitate the comparison of model performance, the same parameters should be set as much as possible

Table 3 .
Experiment results of model multi-step comparison.

Table 4 .
Experiment results of correlation analysis.

Table 5 .
Experiment results of ablation experiments.